Instructions for the benchmark datasets

1, Datasets for general phosphorylation site prediction

For training and testing models in general phosphorylation site prediction, we used the training and independent test 
datasets that were originally prepared in PhosphoSVM: 

Dou, Y., Yao, B., & Zhang, C. (2014). PhosphoSVM: prediction of phosphorylation sites by integrating various protein 
sequence attributes with a support vector machine. Amino acids, 46(6), 1459-1469.

The training and testing datasets for PhosphoSVM can be downloaded from 
http://sysbio.unl.edu/PhosphoSVM/download.php. 


2, Datasets for kinase-specific phosphorylation site prediction

2.1 Dataset Construction

For training and testing models in kinase-specific phosphorylation site prediction, we constructed a benchmark dataset 
extracted from the Swiss-Prot/UniProt database and the Phospho.ELM database. As a result, training and testing datasets 
were constructed for 8 groups, 37 families, 45 subfamilies, and 48 protein kinases. Table S1 provides a statistical 
summary of the numbers of positive and negative sites in the training and testing datasets of the 5 kinase families 
reported in the manuscript. Table S2 provides a statistical summary of the numbers of positive and negative sites in 
the training and testing datasets of the 138 kinase groups, families, subfamilies and protein kinases.

2.2 Structure of the datasets

The training and testing datasets are respectively located in directory ‘Combined_train’ and ‘Combined_test’. In each 
of these two directories, datasets of kinase groups, families, subfamilies and protein kinases are located under the 
directory of their parent node in the hierarchical kinase classification tree. For example, the training dataset of 
protein kinase AGC/PKC/Alpha/PKCA is located under the directory path ‘Combined_train/AGC/PKC/Alpha/PKCA’. 

For each kinase group, family, subfamily or protein kinase, its training or testing dataset is composed of the following 
three parts, 

- ID.txt, a list of the protein identifications 
- chain, a directory containing the amino acid sequences for each of the protein identifications (FASTA format). 
- site, a directory containing the phosphosite labelling for each of the protein identifications, where 1 denotes a 
phosphosite, and 0 denotes a non-phosphosite. 


3, Tables of statistics

Table S1 demonstrated the number of positive and negative sites in the training and testing datasets of the 5 kinase 
families reported in the manuscript. In independent test, we did not perform down-sampling, but only included S, T or Y 
sites that are phosphorylated by specific kinases as the negative samples. For AGC/PKA, AGC/PKC, CMGC and Other/CK2, 
we included all S and T sites that are not annotated as phosphosites as the negative samples. For TK/Src we included 
all Y sites that are not annotated as phosphosites as the negative samples. Therefore, the number of negative sample 
in the testing datasets of AGC/PKA, AGC/PKC, CMGC/CDK, Other/CK2, and TK/Src in Table S1 are different from those in 
Table S2 which include S, T, and Y sites that are not annotated as phosphosites.


Table S1. Statistical summary of the training and testing datasets for the five kinases used in this study.  

-----------------------------------------------------------------------------------------------------------------------------------------
			AGC/PKA			AGC/PKC			CMGC/CDK		Other/CK2		TK/Src
			---------------		---------------		---------------		---------------		---------------
			#pos	#neg		#pos	#neg		#pos	#neg		#pos	#neg		#pos	#neg
			---------------		---------------		---------------		---------------		---------------
Training		716	371,714		766 	253,022		486 	195,788		532 	129,589		496 	182,896
Testing			181	 15,288		196	 10,672		142	  7,443		136	  6,167		135	  8,246
-----------------------------------------------------------------------------------------------------------------------------------------


Table S2 demonstrated the number of positive and negative sites in the training and testing datasets of the 138 kinase 
groups, families, subfamilies and protein kinases. In Table S2, each line corresponds to a kinase (including kinase 
groups, families, subfamilies and protein kinases) and the number of positive and negative samples in its constructed 
training and testing datasets. The negative samples for each kinase include the S, T, and Y sites that are not annota-
ted as phosphorylation sites. The kinase is denoted by its path from the root ‘group’ in the hierarchical structure, 
thereby resulting in the path format of group/family/subfamily/kinase.


Table S2. Statistical summary of the training and testing datasets for the other kinase groups, families, subfamilies, 
and protein kinases that were constructed from the Swiss-Prot/UniProt database and the Phospho.ELM database. 

---------------------------------------------------------------------------------------------------------------------------------------------
					Training 					Testing					Lack of model
---------------------------------------------------------------------------------------------------------------------------------------------
					#pos			#neg			#pos			#neg		 	X
					--------------------------------		-------------------------------		-------------
AGC					2090			124,479			518			31,953
AGC/DMPK				99			5,107			26			1,515
AGC/DMPK/ROCK				88			4,918			30			1,548
AGC/DMPK/ROCK/ROCK1			36			2,057			18			783
AGC/DMPK/ROCK/ROCK2			43			2,955			11			1,036
AGC/GRK					254			5,457			54			1,400
AGC/GRK/BARK				150			3,344			35			1,086
AGC/GRK/BARK/GRK2			131			2,846			42			909
AGC/GRK/GRK				108			2,237			25			465
AGC/GRK/GRK/GRK6			58			984			11			772
AGC/PKA					716			48,435			181			15,288
AGC/PKA/-				472			42,251			132			12,185
AGC/PKA/-/PKA				472			42,251			132			12,185
AGC/PKB					256			20,270			60			5,784
AGC/PKB/-				52			1,977			11			783
AGC/PKB/-/PDK1				52			1,977			11			783
AGC/PKC					766			34,120			196			10,672
AGC/PKC/Alpha				158			9,979			54			3,239
AGC/PKC/Alpha/PKCA			157			9,705			39			2,221
AGC/PKC/Delta				51			3,878			12			1,590
AGC/PKC/Delta/PKCD			51			3,878			12			1,590
AGC/PKC/Eta				41			1,861			9			1,028
AGC/PKC/Eta/PKCE			41			1,861			9			1,028
AGC/PKC/Iota				36			3,473			11			1,010
AGC/PKC/Iota/PKCZ			33			3,003			9			808
AGC/PKG					72			7,061			17			1,476
AGC/RSK					52			3,841			14			1,123
AGC/RSK/RSK				40			3,357			9			573
AGC/SGK					70			5,286			35			1,712
AGC/SGK/-				57			4,271			28			1,379
AGC/SGK/-/SGK1				61			4,024			18			1,017
Atypical				182			12,016			42			3,191
Atypical/PIKK				183			12,047			37			3,043
Atypical/PIKK/ATM			151			10,183			38			3,066
Atypical/PIKK/ATM/ATM			151			10,183			38			3,066
Atypical/PIKK/ATR			34			2,168			22			1,034
Atypical/PIKK/ATR/ATR			34			2,168			22			1,034
CAMK					778			50,519			171			13,955
CAMK/CAMK1				68			4,568			17			2,065
CAMK/CAMK1/-				28			2,120			9			848
CAMK/CAMK1/-/CAMK4			17			1,522			11			951
CAMK/CAMK2				128			10,067			34			4,546
CAMK/CAMK2/-				50			4,660			11			2,174
CAMK/CAMK2/-/CAMK2A			27			2,725			13			605
CAMK/CAMKL				316			28,251			82			6,757
CAMK/CAMKL/AMPK				168			16,562			49			4,132
CAMK/CAMKL/BRSK				25			1,607			8			593
CAMK/CAMKL/LKB				36			4,233			12			1,275
CAMK/CAMKL/LKB/LKB1			36			4,233			12			1,275
CAMK/CAMKL/MARK				35			3,650			12			1,391
CAMK/CAMKL/MARK/MARK1			18			2,413			9			635			X
CAMK/CAMKL/NuaK				24			1,692			10			925
CAMK/CAMKL/NuaK/NUAK1			24			1,692			10			925
CAMK/DAPK				44			2,274			9			310
CAMK/DAPK/-				44			2,251			8			288
CAMK/DAPK/-/DAPK1			28			1,833			9			262
CAMK/MAPKAPK				85			5,051			34			1,296
CAMK/MAPKAPK/MAPKAPK			95			4,660			21			1,497
CAMK/MAPKAPK/MAPKAPK/MAPKAPK2		72			3,416			18			1,036
CAMK/MAPKAPK/MAPKAPK/MAPKAPK5		23			1,205			14			755
CAMK/PHK				74			2,747			27			1,145
CAMK/PKD				50			3,283			11			890
CK1					229			11,667			70			2,944
CK1/CK1					214			11,924			58			1,969
CK1/VRK					29			1,554			14			660
CK1/VRK/-				29			1,554			14			660
CK1/VRK/-/VRK1				29			1,554			14			660
CK1/VRK/-/VRK2				24			1,239			8			433
CMGC					936			58,500			263			13,783
CMGC/CDK				486			29,319			142			7,443
CMGC/CDK/CDC2				215			14,529			47			3,713
CMGC/CDK/CDC2/CDK2			182			14,731			51			2,721
CMGC/CDK/CDK5				159			13,219			42			3,384
CMGC/CDK/CDK5/CDK5			159			13,219			42			3,384			X
CMGC/CDK/CDK7				37			1,979			7			648
CMGC/CDK/CDK7/CDK7			37			1,979			7			648
CMGC/CDK/CDK9				39			2,251			5			641
CMGC/CDK/CDK9/CDK9			39			2,251			5			641			X
CMGC/DYRK				144			9,503			37			2,775
CMGC/DYRK/Dyrk1				17			1,554			4			368
CMGC/DYRK/Dyrk2				80			6,529			20			1,677
CMGC/DYRK/Dyrk2/DYRK2			70			6,181			26			1,667
CMGC/DYRK/HIPK				60			2,779			17			597
CMGC/DYRK/HIPK/HIPK2			49			2,417			13			596
CMGC/GSK				212			15,101			59			3,372
CMGC/GSK/GSK3B				212			15,101			59			3,372
CMGC/MAPK				158			11,054			42			2,382
Other					894			43,866			237			11,192
Other/CK2				532			17,159			136			6,167
Other/CK2/-				100			5,697			22			1,197
Other/CK2/-/CK2A			100			5,697			22			1,197
Other/IKK				80			5,824			52			2,025
Other/IKK/-				83			6,173			22			1,436
Other/IKK/-/IKKB			38			1,907			14			1,117
Other/IKK/-/IKKE			18			1,162			7			360
Other/IKK/-/TBK1			25			1,567			10			479
Other/NEK				36			1,846			8			629
Other/NEK/-				36			1,846			8			629
Other/PLK				197			15,489			64			3,857
Other/PLK/-				197			15,489			64			3,857
Other/PLK/-/PLK1			108			8,932			30			1,980
Other/PLK/-/PLK3			51			3,565			17			960
STE					192			9,822			49			2,764
STE/STE20				121			6,672			27			1,372
STE/STE20/PAKA				99			5,072			24			1,448
STE/STE20/PAKA/PAK1			45			2,848			14			1,891
STE/STE20/PAKA/PAK2			52			1,644			14			861
STE/STE20/PAKA/PAK3			23			1,359			8			418
STE/STE7				46			1,657			20			666
STE/STE7/-				46			1,657			20			666
TK					1099			54,316			265			13,254
TK/Abl					66			4,154			10			783
TK/Abl/-				66			4,154			10			783
TK/Abl/-/ABL				66			4,154			10			783
TK/Csk					116			3,919			41			1,344
TK/Csk/-				116			3,919			41			1,344
TK/Csk/-/CSK				116			3,919			41			1,344
TK/EGFR					73			4,178			15			1,469
TK/EGFR/-				73			4,178			15			1,469
TK/EGFR/-/EGFR				73			4,178			15			1,469
TK/InsR					68			2,405			32			1,440
TK/InsR/-				68			2,405			32			1,440
TK/InsR/-/INSR				65			2,733			16			564
TK/JakA					82			5,147			14			1,126
TK/JakA/-				82			5,147			14			1,126
TK/JakA/-/JAK2				58			3,754			13			999
TK/PDGFR				43			2,549			15			1,237
TK/Src					496			4,935			135			8,246
TK/Src/-				494			29,678			131			7,779
TK/Src/-/FYN				79			6,170			32			1,671			X
TK/Src/-/LCK				53			2,836			21			1,440			X
TK/Src/-/LYN				70			3,465			11			1,432			X
TK/Src/-/SRC				298			17,987			83			5,299			
TK/Syk					74			3,172			20			831
TK/Syk/-				74			3,172			20			831
TK/Syk/-/SYK				62			2,380			17			1,140
TK/Tec					37			2,197			14			1,027
TK/Tec/-				37			2,197			14			1,027---------------------------------------------------------------------------------------------------------------------------------------------------
